Entity Resolution of Institutions in Bibliographic Databases
نویسندگان
چکیده
Acknowledgements Many people have assisted me in carrying out this project. Firstly I would like to thank my academic supervisors, Associate Professor Peter Christen and Dr. Qing Wang for their ideas, support, encouragement and feedback. I would also like to thank Dr. Paul Wong from the ANU Research Office for providing me with a place to work and helpful advice on the project itself and the SCOPUS database. I would like to thank my friends, in particular Swapnil Mishra and Anish Varghese for their good humour and for helping me to keep things in perspective. Lastly, I would like to thank my family for their support and encouragement, and for being so understanding about all the family dinners I missed.Abstract Bibliographic databases are very important for a variety of tasks including measuring research output of institutions and for predicting future areas of research interest. However, incorrect or incomplete data in such databases can compromise any analysis and lead to poor decision making and financial loss. In this project we have performed data matching of institution data in the SCOPUS Bibliographic Database. We used a variety of established data matching methods and adapted them to the suit the particulars of the project. We describe our data cleaning work, including our novel approach for extracting institution names from the values of the organization attribute. We describe the data matching that we have undertaken, both in merging institutions where they have different identifiers but represent the same institution, and in assigning an identifier to records without one. We show that in the first case we can achieve a high coverage and maintain precision over 85%. In the second case the precision drops significantly beyond 40% coverage and we examine reasons why this is occurs. Finally, we present our conclusions along with some suggestions on how this work could be extended in the future.-4
منابع مشابه
Data Cleaning and Matching of Institutions in Bibliographic Databases
Bibliographic databases are very important for a variety of tasks for governments, academic institutions and businesses. These include assessing research output of institutions, performance evaluation of academics and compiling university rankings. However, incorrect or incomplete data in such databases can compromise any analysis and lead to poor decisions and financial loss. In this paper we ...
متن کاملComparison of Bibliographic Databases in Retrieving Information on Telemedicine
Background & Aims: Some of the main questions which can be of importance for those researchers who intend to perform a systematic review in a field of science are: ‘What databases should I use for my review?’; ‘Do all these databases have the same value?’; and ‘Which sourcesretrieved the highest of relevant references?’. The main aim of this work was the identification of the best database for ...
متن کاملThe Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملBibliographic Databases: Some Critical Points
Current flow of information necessitates a systematic approach to what authors, reviewers and editors read and use as references. The objectivity of communication is increasingly dependent on a comprehensive literature search through online databases (1). Academic institutions wishing to succeed in the global competition secure access to the prestigious databases and archives (2). Journal edito...
متن کامل